Demystifying Memory Models Across the Computing Stack Yatin A. Manerkar, Caroline Trippel, Margaret Martonosi Princeton University ISCA 2019

While you wait:

1) Make sure you've got VirtualBox downloaded to your laptop:

https://www.virtualbox.org/wiki/Downloads

2) Make sure you have Tutorial VM downloaded (or use one of the USB drives):

http://check.cs.princeton.edu/tutorial\_vm/Check\_Tools\_VM\_2019.ova

VM Password: mcmsarefun



http://check.cs.princeton.edu/tutorial.html

### Goals

- Reestablish the basics: Why Memory Consistency Models matter... more than ever!
- Give you concrete tools and techniques for broader MCM research
- Foster a broader community conversant and active in MCM issues
- Show connections outwards to other topics: Security, Distributed Systems, etc.
- Get you thinking about future research possibilities in this area



# **Our Approach Today**

Start from basic knowledge of Memory Consistency Models

- Instruction at level of first-year graduate student
- Will give background info.
- If it's too basic or too fast, say so.
- Variety is the spice of life... Intersperse:
  - Theory
  - Techniques
  - Tool specifics
  - Demos



| Thread 0        | Thread 1                                     |
|-----------------|----------------------------------------------|
| 1x = 1;         | <pre>3if (y == 1) print("Answer is:");</pre> |
| <b>2</b> y = 1; | <pre>4 if (x == 1)</pre>                     |

| Thread 0        | Thread 1                                      |
|-----------------|-----------------------------------------------|
| $1 \times = 1;$ | <pre>3 if (y == 1) print("Answer is:");</pre> |
| 2y = 1;         | <pre>4 if (x == 1) print("42");</pre>         |

Can it print "Answer is: 42"?



| Thread O                | Thread 1                                                                                                                                                                                                |
|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $1 \times = 1;$         | <pre>3 if (y == 1) print("Answer is:");</pre>                                                                                                                                                           |
| 2y = 1;                 | <b>4</b> if (x == 1) print("42");                                                                                                                                                                       |
| Con it print "Answer is | $\bullet \mathbf{A} \mathbf{D}^{\prime \prime} \mathbf{D} \mathbf{V} \mathbf{O} \mathbf{C} \mathbf{O} \mathbf{C}^{\prime} \mathbf{O} \mathbf{C} \mathbf{A} \mathbf{D} \mathbf{D} \mathbf{D} \mathbf{A}$ |

Can it print "Answer is: 42"? Yes, eg: 1234



| Thread 0                                                        | Thread 1                                      |
|-----------------------------------------------------------------|-----------------------------------------------|
| $1 \times = 1;$                                                 | <pre>3 if (y == 1) print("Answer is:");</pre> |
| <b>2</b> y = 1;                                                 | <pre>4 if (x == 1)</pre>                      |
| Can it print <b>"Answer is:</b><br>How about just <b>"42"</b> ? | 42"? Yes, eg: 1234                            |



| Thread O                | Thread 1                                     |
|-------------------------|----------------------------------------------|
| 1 x = 1;                | <pre>3if (y == 1) print("Answer is:");</pre> |
| 2y = 1;                 | <b>4</b> if (x == 1) print("42");            |
| Can it print "Answer is | s: 42"? Yes, eg: 1234                        |
| How about just "42"?    | Yes, eg: 1342                                |



| Thread 0                        | Thread 1                                      |
|---------------------------------|-----------------------------------------------|
| $1 \times = 1;$                 | <pre>3 if (y == 1) print("Answer is:");</pre> |
| 2y = 1;                         | 4 if (x == 1)<br>print("42");                 |
| Can it print "Answer is: 42"    | ? Yes, eg: 1234                               |
| How about just "42"?            | Yes, eg: 1342                                 |
| Could it print <b>nothing</b> ? |                                               |

Could it print nothing?



| Thread 0                        | Thread 1                        |
|---------------------------------|---------------------------------|
| $1 \times = 1;$                 | <b>3</b> if (y == 1)            |
|                                 | <pre>print("Answer is:");</pre> |
| 2y = 1;                         | 4 if $(x == 1)$                 |
|                                 | print("42");                    |
| Can it print <b>"Answer is:</b> | 42"? Yes, eg: 1234              |
| How about just "42"?            | Yes, eg: 1342                   |
| Could it print nothing?         | Yes, eg: 3412                   |



| Thread 0                        | Thread 1                        |
|---------------------------------|---------------------------------|
| $1 \times = 1;$                 | <b>3</b> if (y == 1)            |
|                                 | <pre>print("Answer is:");</pre> |
| 2y = 1;                         | 4 if $(x == 1)$                 |
|                                 | print("42");                    |
| Can it print "Answer is: 42     | 2"? Yes, eg: 1234               |
| How about just "42"?            | Yes, eg: 1342                   |
| Could it print <b>nothing</b> ? | Yes, eg: 3412                   |

These executions obey **Sequential Consistency (SC)** [Lamport79], which requires that the results of the overall program correspond to some in-order interleaving of the statements from each individual thread.

| Thread O        | Thread 1                                     |
|-----------------|----------------------------------------------|
| $1 \times = 1;$ | <pre>3if (y == 1) print("Answer is:");</pre> |
| <b>2</b> y = 1; | <pre>4 if (x == 1) print("42");</pre>        |

How about "Answer is:"?





| Thread 0        | Thread 1                                     |
|-----------------|----------------------------------------------|
| $1 \times = 1;$ | <pre>3if (y == 1) print("Answer is:");</pre> |
| <b>2</b> y = 1; | <pre>4 if (x == 1)</pre>                     |
|                 |                                              |

How about "Answer is:"? It depends!



| Thread 0        | Thread 1                                      |
|-----------------|-----------------------------------------------|
| $1 \times = 1;$ | <pre>3 if (y == 1) print("Answer is:");</pre> |
| 2y = 1;         | <pre>4 if (x == 1) print("42");</pre>         |

How about "Answer is:"? It depends!

2134



| Thread O        | Thread 1                                     |
|-----------------|----------------------------------------------|
| $1 \times = 1;$ | <pre>3if (y == 1) print("Answer is:");</pre> |
| 2y = 1;         | <pre>4 if (x == 1) print("42");</pre>        |

2134

How about "Answer is:"? It depends!





Message Passing (mp)

#### **Answer: Performance!**







Answer: Performance!Core 0Core 1x = 1;r1 = y;y = 1;r2 = x;Core 0Core 1x = 1;r1 = y;y = 1;r2 = x;Can r1=1 and r2=0?

Can improve performance by sending both stores to memory in parallel







x = 1;

y = 1;

Core 0

Message Passing (mp)

Core 1

r1 = y;

Core 0

x = 1;

#### **Answer: Performance!**







Message Passing (mp)

#### **Answer: Performance!**







Message Passing (mp) Core 0 Core 1 **Answer: Performance!** r1 = y; x = 1;y = 1;r2 = x;Can r1=1 and r2=0? Core 0 Core 1 r1 = y = 1;**X** =  $r^2 = x = 0$ : Cache By the time store of x is **x: 1** y: 1 complete, Core 1 has observed reordering! Memory



Message Passing (mp)

#### **Answer: Performance!**

| Core Ø   | Core 1    |
|----------|-----------|
| x = 1;   | r1 = y;   |
| y = 1;   | r2 = x;   |
| Can r1=1 | and r2=0? |

Fence/synchronization instructions can enforce order between memory operations where needed





- Compiler optimizations can also result in weak memory behaviours
  - Example below: assume CPU performs instrs in order and 1 at a time

| Thread 0                         | Thread 1               |
|----------------------------------|------------------------|
| 1 x = 1;<br>2 y = 1;<br>3 x = 2; | 4 r1 = y;<br>5 r2 = x; |
| Can r1 = 1 and r2 = 0?           |                        |



- Compiler optimizations can also result in weak memory behaviours
  - Example below: assume CPU performs instrs in order and 1 at a time

Compiler may coalesce these 2 stores (since no same-thread reads of x in between)

| Thread 0                         | Thread 1               |
|----------------------------------|------------------------|
| 1 x = 1;<br>2 y = 1;<br>3 x = 2; | 4 r1 = y;<br>5 r2 = x; |
| Can r1 = 1                       | and r2 = 0?            |



- Compiler optimizations can also result in weak memory behaviours
  - Example below: assume CPU performs instrs in order and 1 at a time

| Thread 0               | Thread 1               |
|------------------------|------------------------|
| 2 y = 1;<br>3 x = 2;   | 4 r1 = y;<br>5 r2 = x; |
| Can r1 = 1 and r2 = 0? |                        |



- Compiler optimizations can also result in weak memory behaviours
  - Example below: assume CPU performs instrs in order and 1 at a time

| Thread 0               | Thread 1               |
|------------------------|------------------------|
| 2 y = 1;<br>3 x = 2;   | 4 r1 = y;<br>5 r2 = x; |
| Can r1 = 1 and r2 = 0? |                        |

### Now **2458** gives r1 = 1 and r2 = 0!



- ISA instructions represent hardware operations (add, sub, ld, st, ...)
- MCMs similarly represent the orderings among hardware memory ops

Compiler

Microarchitecture<sup>1</sup>



- ISA instructions represent hardware operations (add, sub, ld, st, ...)
- MCMs similarly represent the orderings among hardware memory ops



Microarchitecture<sup>1</sup>



- ISA instructions represent hardware operations (add, sub, ld, st, ...)
- MCMs similarly represent the orderings among hardware memory ops



Microarchitecture<sup>1</sup>

How much can I buffer and reorder memory operations?

- ISA instructions represent hardware operations (add, sub, ld, st, ...)
- MCMs similarly represent the orderings among hardware memory ops



- ISA instructions represent hardware operations (add, sub, ld, st, ...)
- MCMs similarly represent the orderings among hardware memory ops

# In a nutshell: MCMs specify what value will be returned when your program does a load!





Memory Consistency Models (MCMs) Specify rules and guarantees about the <u>ordering</u> and <u>visibility</u> of accesses to shared memory [Sorin et al., 2011].



Memory Consistency Models (MCMs)

Specify rules and guarantees about the <u>ordering</u> and <u>visibility</u> of accesses to shared memory [Sorin et al., 2011].



Memory Consistency Models (MCMs) Specify rules and guarantees about the <u>ordering</u> and <u>visibility</u> of accesses to shared memory [Sorin et al., 2011].



Memory Consistency Models (MCMs) Specify rules and guarantees about the <u>ordering</u> and <u>visibility</u> of accesses to shared memory [Sorin et al., 2011].



## How are MCMs specified?

- Natural language?
  - E.g. Sequential Consistency [Lamport 1979]

"The result of any execution is the same as if the **operations of all the processors were executed in some sequential order**, and the **operations of each individual processor appear** in this sequence **in the order specified by its program**."

What about more complicated models?



## How are MCMs specified?

#### Excerpt from the ARMv8 manual (memory model section):

#### Architecturally well-formed

An architecturally well-formed execution must satisfy both of the following requirements:

#### Internal visibility requirement

For a read or a write  $RW_1$  that appears in program order before a read or a write  $RW_2$  to the same Location, the internal visibility requirement requires that exactly one of the following statements is true:

- RW<sub>2</sub> is a write W<sub>2</sub> that is Coherence-after RW<sub>1</sub>.
- $RW_1$  is a write  $W_1$  and  $RW_2$  is a read  $R_2$  such that either:
  - R<sub>2</sub> Reads-from W<sub>1</sub>.
  - R<sub>2</sub> Reads-from another write that is Coherence-after W<sub>1</sub>.
- $RW_1$  and  $RW_2$  are both reads  $R_1$  and  $R_2$  such that  $R_1$  Reads-from a write  $W_3$  and either:
  - R<sub>2</sub> Reads-from W<sub>3</sub>.
  - R<sub>2</sub> Reads-from another write that is Coherence-after W<sub>3</sub>.

#### 

If a Memory effect  $M_1$  from an Observer appears in program order before a Memory effect  $M_2$  from the same Observer, then  $M_1$  will be seen to occur before  $M_2$  by that Observer.



- ISA-level MCMs defined using relational patterns [Shasha and Snir TOPLAS 1988]
- ISA-level executions are graphs
  - nodes: instructions, edges: ISA-level relations
- Eg: SC is *acyclic*(*po* ∪ *co* ∪ *rf* ∪ *fr*)

(i1) 
$$\stackrel{\text{po}}{\longrightarrow}$$
 (i2)  $\stackrel{\text{rf}}{\longrightarrow}$  (i3)  $\stackrel{\text{po}}{\longrightarrow}$  (i4)

| Message passing | (mp | ) litmus test |
|-----------------|-----|---------------|
|-----------------|-----|---------------|

| Core 0                     | Core 1                       |
|----------------------------|------------------------------|
| (i1) x = 1;<br>(i2) y = 1; | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: r              | 1 = 1, r2 = 0                |

Legend: po = Program order co = Coherence order rf = Reads-from fr = From-reads

- ISA-level MCMs defined using relational patterns [Shasha and Snir TOPLAS 1988]
- ISA-level executions are graphs
  - nodes: instructions, edges: ISA-level relations
- Eg: SC is *acyclic*(*po* ∪ *co* ∪ *rf* ∪ *fr*)

(i1) 
$$\stackrel{\text{fr}}{\longrightarrow}$$
 (i2)  $\stackrel{\text{rf}}{\longrightarrow}$  (i3)  $\stackrel{\text{po}}{\longrightarrow}$  (i4)

#### Message passing (mp) litmus test

| Core 0                     | Core 1                       |
|----------------------------|------------------------------|
| (i1) x = 1;<br>(i2) y = 1; | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: r              | 1 = 1, r2 = 0                |

#### Legend: po = Program order co = Coherence order rf = Reads-from fr = From-reads

- ISA-level MCMs defined using relational patterns [Shasha and Snir TOPLAS 1988]
- ISA-level executions are graphs
  - nodes: instructions, edges: ISA-level relations
- Eg: SC is *acyclic*(*po* ∪ *co* ∪ *rf* ∪ *fr*)



#### Message passing (mp) litmus test

| Core 0                     | Core 1                       |
|----------------------------|------------------------------|
| (i1) x = 1;<br>(i2) y = 1; | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: r              | 1 = 1, r2 = 0                |

#### Legend:

- po = Program order co = Coherence order
- rf = Reads-from
- fr = From-reads

- ISA-level MCMs defined using relational patterns [Shasha and Snir TOPLAS 1988]
- ISA-level executions are graphs
  - nodes: instructions, edges: ISA-level relations
- Eg: SC is *acyclic*(*po* ∪ *co* ∪ *rf* ∪ *fr*)

(i1) 
$$\stackrel{\text{po}}{\longrightarrow}$$
 (i2)  $\stackrel{\text{rf}}{\longrightarrow}$  (i3)  $\stackrel{\text{po}}{\longrightarrow}$  (i4)

| Message | passing | (mp) | litmus test |
|---------|---------|------|-------------|
| Incouge |         |      |             |

| Core 0                     | Core 1                       |
|----------------------------|------------------------------|
| (i1) x = 1;<br>(i2) y = 1; | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: r              | 1 = 1, r2 = 0                |

| Legend:      |
|--------------|
| po = Program |
| - ·          |

co = Coherence order

order

- rf = Reads-from
- fr = From-reads

- ISA-level MCMs defined using relational patterns [Shasha and Snir TOPLAS 1988]
- ISA-level executions are graphs
  - nodes: instructions, edges: ISA-level relations
- Eg: SC is *acyclic*(*po* ∪ *co* ∪ *rf* ∪ *fr*)

- Formal specifications of ISA + HLL MCMs
  - x86 [Owens et al. TPHOLS2009], ARM [Pulte et al. POPL2018], C11 [Batty et al. POPL 2011], ...
- Automated formal tools e.g. herd [Alglave et al. TOPLAS 2014]

(i1)  $\stackrel{\text{po}}{\longrightarrow}$  (i2)  $\stackrel{\text{rf}}{\longrightarrow}$  (i3)  $\stackrel{\text{po}}{\longrightarrow}$  (i4)

• Can formally analyse small test programs against these models

| IVIESSAGE passing (mp) litmus test |                              |
|------------------------------------|------------------------------|
| Core 0                             | Core 1                       |
| (i1) x = 1;<br>(i2) y = 1;         | (i3) r1 = y;<br>(i4) r2 = x; |
| SC Forbids: r                      | $1 = 1, r^2 = 0$             |

| Legend:              |
|----------------------|
| po = Program order   |
| co = Coherence order |
| rf = Reads-from      |
| fr = From-reads      |

- MCM specified at an interface between layers of the stack
- Upper layers target the MCM; lower layers must maintain it!





- MCM specified at an interface between layers of the stack
- Upper layers target the MCM; lower layers must maintain it!



- MCM specified at an interface between layers of the stack
- Upper layers target the MCM; lower layers must maintain it!



- MCM specified at an interface between layers of the stack
- Upper layers target the MCM; lower layers must maintain it!









Our Approach

- Axiomatic specifications -> Happens-before graphs
- Check Happens-Before Graphs via Efficient SMT solvers
  - <u>Cyclic</u> => A->B->C->A... Can't happen
  - <u>Acyclic</u> => Scenario is observable





#### <u>Our Approach</u>

- Axiomatic specifications -> Happens-before graphs
- Check Happens-Before Graphs via Efficient SMT solvers
  - <u>Cyclic</u> => A->B->C->A... Can't happen
  - <u>Acyclic</u> => Scenario is observable



- Cyclic => A->B->C->A... Can't happen
- <u>Acyclic</u> => Scenario is observable

For more

SpectrePrime, MeltdownPrime

C++ 11 mem model

## In a nutshell, our tool philosophy...

- Automate specification, verification, and translation related to MCMs
- Comprehensive exploration of ordering possibilities
- Key Techniques: Happens-before Graphs and SMT solvers
- Initially: Litmus-test driven (small test programs, 4-8 instrs)
- Now: PipeProof demonstrates complete (i.e. all-program) analysis



## Outline

- Overview, Motivation, and MCM Background (15 minutes) (mm)
- PipeCheck: Verifying Microarchitectural Implementations against ISA Specs (45 minutes)
  - Includes hands-on of using uSpec DSL for specifying axioms (30 minutes) (ym)
- PipeProof: Beyond Litmus Tests (45 minutes) (ym)
  - Includes hands-on of proving simple microarch. across all programs (25 minutes)

- Coffee Break. 11-11:20
- Up and Down the Stack
  - RTLCheck (15 minutes) (ym)
  - TriCheck (10 minutes) (ct)
- Looking forward: Other uses of tools and techniques
  - CheckMate for security (25 minutes) (ct)
- Conclusions and Bigger Picture (10 minutes)

